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1.  Technical  Project  Summary 

Knowledge  base  refinement  is  the  modification  of  an  existing  expert  system  knowledge  base 
with  the  goals  of  localizing  specific  weaknesses  in  a  knowledge  base  and  improving  an  expert 
system's  performance.  Systems  that  automate  some  aspects  of  knowledge  base  refinement  can 
have  a  significant  impact  on  the  related  problems  of  knowledge  base  acquisition,  maintenance, 
verification,  and  learning  from  experience.  The  SEEK  system  was  the  first  expert  system 
framework  to  integrate  large-scale  performance  information  into  all  phases  of  knowledge  base 
development  and  to  provide  automatic  information  about  rule  refinement.  A  recently  developed 
successor  system,  SEEK2  [Ginsberg,  Weiss,  and  Politakis  88]  significantly  expands  the  scope  of  the 
original  system  in  terms  of  generality  and  automated  capabilities.  The  investigators  expect  to 
make  significant  progress  in  automating  empirical  expert  system  techniques  for  knowledge 
acquisition,  knowledge  base  refinement,  ma  mtenance,  and  verification. 


2.  Principal  Expected  Innovations 

The  investigators  will  demonstrate  a  rule  refinement  system  in  an  application  of  the  diagnosis  of 
complex  equipment  failure:  computer  network  troubleshooting.  The  expert  system  should 
demonstrate  the  following  advanced  capabilities: 

•  automatic  localization  of  knowledge  base  weaknesses 

•  automatic  repair  (refinement)  of  poorly  performing  rules 

•  automatic  verification  of  new  knowledge  base  rules 

•  automatic  learning  capabilities  ^ 


3.  Objectives  for  FY89 

These  are  our  objectives  for  the  current  year.  Fiscal  year  89: 


•  full  demonstration  of  refinement  system,  using  subset  of  DEC'S  Network 
Troubleshooting  Consultant  (NTC).  System  will  automatically  recover  from  many 
forms  of  damage  to  knowledge  base. 


'full  demonstration  of  system  with  capabilities  for  automatic  refinement,  and^,.^> 
verification  of  knowledge  base  consistency.  Empirical  experiments  will  be  performed  :  ...t  ' 

and  results  will  be  reported.  .  "" ' 
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'  demonstration  of  significant  automated  rule  learning  capabilities. 
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demonstration  of  extended  system  capabilities  for  alternative  control  strategics  and 
representations.  on/  ^  ^ 
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•  completed  comparative  studies  of  empirical  techniques  for  machine  learning,  statistical 
pattern  recognition,  and  neural  nets. 


4.  Summary  of  Progress 

During  the  previous  year  the  following  was  accomplished: 


•  initial  functioning  equipment  diagnosis  and  repair  knowledge  base,  suitable  for 
refinement.  This  is  a  subset  of  DEC'S  Network  Troubleshooting  Consultant  (NTC). 

•  initial  demonstration  of  functioning  equipment  diagnostic  system  with  capabilities  of 
localization  of  weak  rules,  automatic  refinement,  automatic  verification. 

•  demonstration  of  initial  rule  learning  capabilities. 

•  development  of  case  generation  simulator  and  randomized  rule  modifier. 

•  initial  comparative  studies  demonstrating  superiority  of  PVM  rule  induction 
procedure. 

This  work  is  the  basis  for  further  progress  in  developing  an  automated  refinement  system.  We 
are  pursuing  the  refinement  and  learning  tasks  from  both  an  expert  system  rule-based  perspective 
and  a  machine  learning  rule  induction  perspective.  In  order  to  develop  the  strongest  form  of 
refinement  system,  we  have  examined  numerous  techniques  for  empirical  rule  induction.  W«  have 
also  developed  a  procedure.  Predictive  Value  Maximization,  that  shows  strong  resa..s  for 
induction  of  single  relatively  short  rules.  Our  fundamental  objective  is  to  mix  the  best  rule 
induction  procedures  with  a  rule-based  expert  system  to  achieve  the  strongest  empirical  results. 

Here  are  the  highlights  of  new  progress  in  meeting  our  stated  objectives  for  fiscal  year  89: 


•  We  have  completed  an  extensive  empirical  comparison  of  machine  learning  rule 
induction  techniques  with  statistical  pattern  recognition  techniques,  and  neural  nets. 
Four  real-world  data  sets  were  analyzed  using  different  techniques.  The  study  required 
over  6  months  of  Sun  4  CPU  time.  The  results  are  described  in  a  completed  paper  that 
will  be  published  and  presented  at  the  1989  International  Joint  Conference  on  Arti.fida’ 
Intelligence. 

•  We  have  completed  a  procedure  for  the  refinement  system  that  uses  rule  induction 
techniques.  This  procedures  gives  the  refinement  system  a  learning  capability  which  is 
the  most  difficult  and  important  of  our  major  research  objectives  for  this  fiscal  year. 


The  fundamental  approach  of  rule  refinement  is  to  constrain  changes  that  car  be  made  to  the 
knowledge  base  to  those  thut  arc  fully  consistent  with  the  rules  of  the  expert-supplied  knowledge 
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base.  Unlike  a  refinement  system,  a  pure  learning  system  such  as  a  rule  induction  system,  attempts 
to  learn  directly  from  data,  unconstrained  by  human  expert  knowledge.  A  more  constrained 
learning  approach  maintains  the  expert  supplied  rules  but  allows  for  some  additions  to  the  rules. 
The  new  learning  procedures  added  to  the  refinement  system  use  generalization  and  specialization 
models  to  perform  2  functions: 

•  add  a  variable  to  a  rule  to  specialize  the  rule 

•  add  a  new  rule  to  the  knowledge  base  to  generalize  the  rule 

The  procedure  for  adding  components  and  rules  will  be  detailed  in  the  next  report.  Some  key 
parts  of  the  procedure  are  analogous  to  current  tree  generation  procedures  such  as  ID3/C4  or 
CART,  where  the  split  is  performed  on  the  single  best  node.  In  our  case  during  a  given  refinement 
cycle,  we  attempt  to  induce  the  single  best  variable  and  decision  threshold.  The  following 
preliminary  results  were  found  for  a  knowledge  base  of  100  rules  and  5  endpoints  that  previously 
was  refined  from  a  performance  of  73%  (88/121)  to  100%  (121/121). 

•  The  same  100%  refinement  performance  was  achieved  with  the  learning  capability. 

•  When  all  100  rules,  with  an  average  of  4  variables  per  rule,  were  deleted  from  the 
knowledge  base,  the  system  was  able  to  generate  14  rules  and  21  variables  that 
achieved  88%  (107/121)  correct  classification. 

While  these  results  are  preliminary,  they  demonstrate  the  potential  for  robust  mixed  knowledge 
base  refinement  and  learning  procedures. 


5.  Financial  Review 

1.  Basic  contract  dollar  amount:  $536,919(9/1/87-8/31/89) 

2.  Dollar  amounts  and  purposes  of  options:  None 

3.  Total  spending  authority  received  to  date:  $475,000  through  1  /31/89 

4.  Total  spending  to  date:  351,559  through  5/31/89 

5.  Monthly  expenditure  rate:  We  anticipate  funding  larger  portions  of  the  summer 
salaries  of  the  principal  investigators  over  the  coming  summer  as  well  as  more 
systems  programmer  salary  (in  light  of  our  increased  effort  being  devoted  during  the 
snmn'er  to  the  research  pro’crt).  Wc  have  also  brought  on  board  one  more  graduate 
assistant  to  assist  in  this  research,  resulting  in  higher  salary  expenditures  anticipated 
throughout  the  current  (1988-89)  academic  year  and  summer  of  1989. 
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6.  We  have  funded  a  total  of  approximately  $351^59  to  date.  This  would,  therefore, 
result  in  an  average  monthly  expenditure  rate  of  $17,578. 

7.  Major  non-salary  expenditures  planned  within  this  increment  of  funding:  None 

8.  Date  next  increment  of  funds  is  needed:  January,  1989. 

I,  Technical  Report 

A  paper  entitled  An  Empirical  Comparison  of  Pattern  Recognition,  Neural  Nets,  and  Machine  Learning 
Classification  Methods,  to  be  published  and  presented  at  the  1989  International  Joint  Conference  on 
Artificial  Intelligence,  is  enclosed  with  this  quarterly  report  [Weiss  and  Kapouleas  89]. 
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Abstract 

Classification  methods  from  statistical  pattern 
recognition,  neural  nets,  and  machine  learning  were 
applied  to  four  real-world  data  sets.  Each  of  these  data 
sets  has  been  previously  analyzed  and  reported  in  the 
statistical,  meoical,  or  machine  learning  literature.  The 
data  sets  are  characterized  by  statisucal  uncertainty; 
there  is  no  completely  accurate  solution  to  these 
problems.  Training  and  testing  or  resampling 
techniques  are  used  to  estimate  the  true  error  rates  of 
the  classification  methods.  Detailed  attention  is  given 
to  the  analysis  of  performance  of  the  neural  nets  using 
back  propagation.  For  these  problems,  which  have 
relatively  few  hypotheses  and  features,  the  machine 
learning  procedures  for  rule  induction  or  tree  induction 
clearly  performed  besL^ 

1  Introduction 

Many  decision-making  problems  fall  into  the  general 
category  of  classificauon  [Clancey,  1985,  Weiss  and 
Kulikowski,  1984,  James,  1985].  Diagnostic  decision 
making  is  a  typical  example.  Empirical  learning  techniques 
for  classificauon  span  roughly  two  categories:  stausucal 
pattern  recognition  [Duda  and  Hart,  1973,  Fukunaga,  19721 
(including  neural  nets  [McClelland  and  Rumelhart,  1988]) 
and  machine  learning  techniques  for  induction  of  decision 
trees  or  production  rules.  While  a  method  from  either 
category  is  usually  applicable  to  the  same  problem,  the  two 
categories  of  proceaures  can  differ  radically  in  their 
underlying  models  and  the  final  format  of  their  solution. 
Both  approaches  to  (supervised)  learning  can  be  used  to 
classify  a  sample  panem  (example)  into  a  specific  class. 
However,  a  rule-based  or  decision  tree  approach  offers  a 
modularized,  clearly  explained  format  for  a  decision,  and  is 
compatible  with  a  human’s  reasoning  procedures  and  expert 
system  knowledge  bases. 

Statistical  pattern  recognition  is  a  relatively  mature  field. 
Pattern  recognition  methods  have  been  studied  for  many 
years,  and  the  theory  is  highly  developed  [Duda  and  Hart, 
1973,  Fukunaga,  1972].  In  recent  years,  there  has  been  a 
surge  in  interest  in  newer  models  of  classification, 
specifically  methods  from  machine  learning  and  neural  nets. 

Methods  of  induction  of  decision  trees  from  empirical 
data  have  been  studied  by  researchers  in  both  artificial 
intelligence  and  statistics.  Quinlan’s  ID3  [Quinlan, 
1986]  and  C4  [Quinlan,  1987a]  procedures  for  induction  of 
decision  trees  are  well  known  in  the  machine  learning 
community.  The  Classification  and  Regression  Trees 
(CART)  [Breiman,  Friedman,  Olshen,  and  Stone, 
1984]  procedure  is  a  major  nonparametric  classification 
technique  that  was  developed  by  statisticians  during  the 
same  period  as  ID3.  Production  rules  are  related  to  decision 
trees;  each  path  in  a  decision  tree  can  be  considered  a 


'This  research  was  supported  in  part  by  OXR  Contract  N00014-87- 
K  0398  and  MH  Grant  P41 -RR02230. 


distinct  production  rule.  Unlike  decision  trees,  a  disjunctive 
set  of  production  rules  need  not  be  mutually  exclusive. 
Among  the  principal  techniques  of  induction  of  production 
rules  from  empirical  data  are  Michalski's  AQ15 
system  [Michalski,  Mozetic,  Hong,  and  Lavrac,  1986]  and 
recent  work  by  Quinlan  in  deriving  production  rules  from  a 
collection  of  decision  trees  [Quinlan,  1987b]. 

Neural  net  research  activity  has  increa^  dramatically 
following  many  reports  of  successful  classification  using 
hidden  units  and  the  back  propagation  learning  technique. 
This  is  an  area  where  researchers  are  still  exploring  learning 
methods,  and  the  theory  is  evolving. 

Researchers  from  all  these  fields  have  all  explored  similar 
problems  using  different  classification  models. 
Occasionally,  some  classical  discriminant  methods  are  cited 
in  comparison  with  results  for  a  newer  technique  such  as  a 
comparison  of  netual  nets  with  nearest  neighbor  techniques. 
In  this  paper,  we  report  on  results  of  an  extensive 
comparison  of  classification  methods  on  the  same  data  sets. 
Because  of  the  recent  heightened  interest  in  neural  nets,  and 
in  particular  the  back  propagation  method,  we  present  a 
more  detailed  analysis  of  the  performance  of  this  method. 
We  selected  problems  that  are  typical  of  many  applications 
that  deal  with  uncertainty,  for  example  medical  applications. 
In  such  problems,  such  as  determining  who  will  survive 
cancer,  there  is  no  completely  accurate  answer.  In  addition, 
we  may  have  a  relatively  small  data  set.  An  analysis  of  each 
of  the  data  sets  that  we  examined  has  been  previously 
published  in  the  literature. 

2  Methods 

We  are  given  a  data  set  consisting  of  patterns  of  features  and 
correct  classifications.  This  data  set  is  assumed  to  be  a 
random  sample  from  some  larger  population,  and  the  task  is 
to  classify  new  patterns  correctly.  Tne  performance  of  each 
method  is  measured  by  its  error  rate.  If  unlimited  cases  for 
training  and  testing  are  available,  the  error  rate  can  readily 
be  obtained  as  the  error  rate  on  the  test  cases.  Because  we 
have  far  fewer  cases,  we  must  use  resampling  techniques  for 
estimating  error  rates.  These  are  described  in  the  next 
section.^ 

2.1.  Estimating  Error  Rates 

It  is  well  known  that  the  apparent  error  rate  of  a  classifier 
on  all  the  training  cases^  can  lead  to  highly  misleading  and 


^hiJe  there  ha?  been  much  recent  interest  in  the  "probably 
approximately  correct'  (TAC)  theoretical  analysis  for  both  rule 
induction  (Valiant,  1985,  Haussler,  1988)  and  neural  nets  [Baum,  1989], 
the  PAC  analysts  is  a  worst  case  analysis  to  guarantee  for  all  possible 
dislnbulions  that  results  on  a  training  set  are  correct  to  withm  a  small 
margui  of  error.  For  a  real  problem,  one  is  given  a  sample  from  a  single 
distnbution,  and  the  task  is  to  estimate  the  true  error  rate.  This  type  of 
analysis  requires  far  fewer  cases,  because  only  a  smgle  albeit  unknown 
distnbution  is  considered  and  independent  cases  are  used  for  testmg. 

’This  IS  someiunes  referred  lo  as  ihe  resubsmuiion  or  reclassification 
error  rate. 


usually  over-t^i^isdc  estimates  of  performance  [Duda  and 
Hart,  1973].  TTiis  is  due  to  overspecialization  of  the 
classiHer  to  the  data.^ 

Techniques  for  estimating  error  rates  have  been  widely 
studied  in  the  statistics  [Efron,  1982]  and  paoem 
recognition  [Duda  and  Hm,  1973,  Fukunaga,  1972] 
literature.  The  simplest  technique  for  "honestly'^estimating 
error  rates,  the  holdout  or  H  method,  is  a  single  train  and 
test  experiment.  The  s^ple  cases  are  brolwn  into  two 
groups  of  cases:  a  training  group  and  a  test  group.  The 
classifier  is  indqiendently  denved  from  the  training  cases, 
and  the  error  estimate  is  the  performance  of  the  classifier  on 
the  test  cases.  A  single  random  partition  of  train  and  test 
cases  can  be  somewhat  misleading^.  The  estimated  size  of 
the  test  sample  needed  for  a  95%  confidence  interval  is 
described  in  [Highleyman,  1962].  With  1000  independent 
test  cases,  one  can  be  virtually  certain  that  the  error  rate  on 
the  test  cases  is  v^  close  to  the  true  error  rate. 

Instead  of  relying  on  a  single  train  and  test  experiment, 
multiple  random  test  and  train  experiments  can  be 
performed.  For  each  random  train  and  test  partition,  a  new 
classifier  is  derived.  The  estimated  error  rate  is  the  average 
of  the  errw  rates  for  classifiers  derived  for  the  independently 
and  randomly  generated  partitions.  Random  resampling  can 
produce  better  error  esamates  than  a  single  train  and  test 
partition. 

A  special  case  of  resanmling  is  known  as 
leaving-one-out  [Fukunaga,  1972,  Efron,  1982].  Leaving- 
One-Out  is  an  eluant  and  straightforward  technique  for 
estimating  classifier  error  rates.  Because  it  is 
computationally  expensive,  it  is  often  reserved  for  relatively 
small  samples.  For  a  given  method  and  sample  size  n,  a 
classifier  is  generated  using  n-1  cases  and  tested  on  the 
remaining  case.  This  is  repeated  n  times,  each  time 
designing  a  classifier  by  leaving-one-out.  Each  case  is  used 
as  a  test  case  and,  each  time  nearly  all  the  cases  are  used  to 
design  a  classifier.  The  error  rate  is  the  number  of  errors  on 
the  single  test  cases  divided  by  n. 

Evidence  for  the  superiority  of  the  leaving-one-out 
approach  is  well-documented  [Lachenbruch  and  Mickey, 
1%8,  Efron,  1982].  While  leaving-one-out  is  a  preferred 
technique,  with  large  samples  it  may  be  computationally 
expensive.  However  as  the  sample  size  grows,  tradition^u 
tram  and  test  methods  improve  their  accuracy  in  estimating 
error  [Kanal  and  Chandrasekaran,  1971]. 

The  leaving-one-out  error  technique  is  a  special  case  of 
the  general  class  of  cross  valicuition  error  estimation 
methods  [Stone,  1974],  In  k-fold  cross  validation,  the  cases 
are  randomly  divided  into  k  mutually  exclusive  test 
partitions  of  approximately  equal  size.  The  cases  not  found 
m  each  test  raitition  are  independently  used  for  training, 
and  the  resulting  classifier  is  tested  on  the  corresponding 
test  partition.  The  average  error  rates  over  all  k  pamtions  is 
the  cross-validated  error  rate.  The  CART  procedure  was 
extensively  tested  with  varying  numbers  of  partitions  and 
10-fold  cross  validation  seemed  to  be  adequate  and  accurate, 
particularly  for  large  samples  where  leaving-one-out  is 
computationally  expensive  [Breiman,  Friedman,  Olshen, 
and  Stone,  1984]^  For  small  samples,  bootstrapping,  a 
method  for  resampling  with  replacement,  has  shown  much 
promise  as  a  low  variance  estimator  for  classifiers  [Efron, 
1983,  Jain,  Dubes,  and  Chen,  1987,  Crawford,  1989].  This 
is  an  area  of  active  research  in  applied  statistics. 

Figure  1  compares  the  techniques  of  error  estimation  for  a 


*In  the  extreme,  a  classifier  tan  be  constructed  that  simply  consists  of  all 
patterns  in  the  given  sample.  Assuming  idenucal  patterns  do  not  belong  to 
different  classes,  this  yields  perfect  classification  on  the  sample  cases. 

^Empirical  results  also  support  the  stratification  of  cases  in  the  tram  and 
test  sets  to  approximate  the  percentage  (prevalence)  of  each  class  in  the 
overall  sample. 


sample  of  n  cases.  The  estimated  error  rate  is  the  average  of 
the  errcr  rates  over  the  number  of  iterations.  While  uiese 
error  estimation  techniques  were  known  and  published  in 
the  1960s  and  early  1970s,  the  increase  in  computational 
speeds  of  computers,  makes  them  much  more  viaole  today 
for  larger  samples  and  more  complex  classification 
techniques  [Steen,  1988]. 
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Figure  1:  Comparison  of  Techniques  for  Estimating  Error  Rates 

Besides  improved  error  estimates,  there  are  a  number  of 
significant  advantages  to  resampling.  The  goal  of  separating 
a  sample  of  cases  into  a  u-aining  set  and  testing  set  is  to  help 
design  a  classifier  with  a  minimum  error  rate.  With  a  single 
train  and  test  partition,  too  few  cases  in  the  training  group 
can  lead  to  the  design  of  a  poor  classifier,  while  too  few  test 
cases  can  lead  to  erroneous  error  estimates.  Leaving-One- 
Out,  and  to  a  lesser  extent  random  resampling,  allow  for 
accurate  estimates  of  error  rates  while  training  on  most 
cases.  For  purposes  of  comparison  of  classifiers  and 
methods,  resampling  provides  an  added  advantage.  Using 
the  same  data,  researchers  can  readily  duplicate  analysis 
conditions  and  compare  published  error  estimates  with  new 
results.  Using  only  a  single  random  train  and  test  partition 
introduces  the  possibility  of  variability  of  partitions  to 
explain  the  divergence  from  a  published  result. 

2.2.  Classification  Methods 

In  this  section,  the  specific  classification  methods  used  in 
the  comparison  will  be  described.  We  do  not  review  the 
methods  or  their  mathematics,  but  rather  state  the  conditions 
under  which  they  were  applied.  References  to  all  methods 
are  readily  available.  Our  goal  is  to  apply  each  of  these 
methods  to  the  same  data  sets  and  report  the  results. 

2.2.1.  Statistical  Pattern  Recognition 

Several  classical  pattern  recognition  methods  were  used. 
Figure  2  lists  these  methods.  These  methods  are  well- 
known  and  will  not  be  discussed  in  detail.  The  reader  is 
referred  to  [Duda  and  Han,  1973]  for  further  details. 
Instead,  we  give  the  specific  variation  of  the  method  that  we 
used. 


Linear  discriminant 


Quadratic  discnminant 


Nearest  .Neighbor 
Bayes  independence 
Bayes  second  order 


Figure  2;  StatLstIcal  Pattern  Recognition  Methods 

The  linear  and  quadratic  discriminants  are  the  standard 
multivariate  normal  discnminancs.  The  linear  discriminant 


simplifies  the  normality  assumption  to  equal  covariance 
matrices.  This  is  probably  the  most  commonly  used  form  of 
discriminant  analysis;  we  used  the  canned  SAS  and  IMSL 
programs.  A  recent  report  has  demonstrated  improved 
’’esults  in  game  playing  evaluation  functions  using  the 
quadratic  classifier  [Lw,  l988]. 

We  used  the  nearest  neig^r  method  (k=l)  with  the 
Euclidean  distance  metric,  ^is  is  one  of  the  simplest 
methods  conceptually,  and  is  commonly  cited  as  a  basis  of 
comparison  with  other  methods.  It  is  often  used  in  case- 
based  reasoning  [Waltz,  1986]. 

Bayes  rule  is  the  optimal  presentation  of  minimum  entH' 
classification.  All  classification  methods  can  be  viewed  as 
approximations  to  Bayes  optimal  classifiers.  Because  the 
Bayes  optimal  classifier  r^uires  complete  probability  data 
for  all  dependencies  in  its  invocation,  for  real  problems  this 
would  be  impossible.  As  with  other  methods,  simplifying 
assiuiiptions  are  made.  The  usual  simplification  is  to  assume 
condiuonal  independence  of  observations.  While  one  can 
point  to  dozens  of  classifiers  that  have  been  built 
(particularly  in  medical  applications  [Szolovits  and  Pauker, 
1978])  using  Bayes  rule  with  independence,  such 
approaches  have  also  been  recently  reported  in  the  AI 
literature  (although  in  the  context  of  unsupervised 
learning)  [Cheeseman,  1988].  Although  indepenaence  is 
commonly  assumed,  there  are  mathematical  expansions  to 
incorporate  higher  order  correlations  among  the 
observations.  In  our  experiments,  we  tried  both  Bayes  with 
independence  and  Bayes  with  the  second  order  Bahadur 
expansion.® 

2.2.2.  Neural  Nets 

A  fullv  connected  neural  net  with  a  single  hidden  layer  was 
consiaered.  The  back  propagation  procedure  [McClelland 
and  Rumelhan,  1988]  was  employed  and  the  general  outline 
of  the  data  analysis  described  in  [Gorman,  1988]  was 
followed.  The  specific  implementation  used  was 
[McClelland  and  Rumelhan,  1988].^  In  most  experiments 
a  learning  rate  of  1  and  a  momentum  of  0  was  used.* 
Patterns  were  presented  randomly  to  the  learning  system.’. 

The  analysis  model  of  [Gorman,  1988]  corresponds  to  a 
10-fold  cross  validation.  Unlike  the  other  methods 
examined  in  this  study,  back  propagation  usually 
commences  with  the  network  weights  in  a  random  state. 
Thus,  even  with  sequential  presentation  of  cases,  the 
weights  for  one  leamM  network  are  unlikely  to  match  the 
same  network  that  starts  in  a  different  random  state.  There 
is  also  the  possibility  of  the  procedure  reaching  a  local 
maximum.  In  this  analysis  model,  for  each  train  and  test 
experiment,  the  weights  are  learned  10  limes,  and  test 
results  averaged  over  all  10  experiments.  Therefore,  10 
times  the  usual  number  of  training  trials  must  be  considered. 
For  a  10-fold  cross-validation,  ICX)  learning  experiments  are 
made. 

For  each  data  set,  these  experiments  were  repeated  for 
networks  having  0,2,3,6,9,12,  or  24  hidden  units  (in  a  single 
layer).  This  is  equivalent  to  using  resampling  to  estimate  die 
appropriate  number  of  hidden  units.  Because  the  data  sets 
may  not  be  separable  with  these  numbers  of  hidden  units, 
we  look  the  following  measures  to  determine  a  sufficient 


amount  of  computation  time.  Before  doing  the  train  and  test 
experiments,  the  nets  were  trained  several  times  on  all 
samples  for  all  size  hidden  units.  We  determined  a  number 
of  epochs,  i.e.  complete  presentations  of  the  data  set,  that 
was  sufficient  to  result  in  each  increment  of  additional 
hidden  units  fitting  the  cases  better  than  the  lesser  number 
of  hidden  units.  In  addition,  for  one  problem  where  the  data 
set  was  extremely  large,  we  sampled  the  results  every  500 
epochs,  and  computed  whether  the  average  total  squared 
error  continued  to  be  reduced.  This  indicated  whether 
prtOTess  was  being  made. 

(5ne  output  unit  was  used  for  each  class.  The  hypothesis 
with  the  highest  weight  was  selected  as  the  conclusion  of 
the  classifier,  and  the  error  rate  was  computed. 

This  is  the  general  outline  of  the  procedures  followed.  In 
Section  3,  we  describe  the  variations  on  this  theme  that  were 
necessary  for  the  specific  data  set  analyses. 

For  computational  reasons,  in  some  instances  it  was 
necessary  to  reduce  the  number  of  repeated  trials  to  be 
averaged.  For  back  propagation,  we  described  a 
computational  procedure  that  performed  10  train  and  test 
experiments  for  each  one  that  would  be  necessary  for  other 
methods.  However,  the  data  sets  described  in  Section  3  are 
not  readily  separable.  Thus,  the  computation  demands  are 
quite  large.  We  estimate  that  6  months  of  Sun  4/280  cpu 
time  were  expended  to  compute  the  neural  nets  results  in 
Section  3. 

2.2  J.  Machine  Learning  Methods 

In  this  category,  we  place  methods  that  produce  logistic 
solutions.  As  indicated  earlier  these  metnods  have  b^n 
explored  by  both  the  machine  learning  and  statistics 
community.  These  are  methods  that  produce  solutions 
ppsed  as  production  rules  or  decision  trees.  Conjunction  or 
disjunction  may  be  used  as  well  as  logical  comparison 
operators  on  continuous  variables  such  as  greater  than  or 
less  than. 

Predictive  Value  Maximization  [Weiss,  Galen,  and 
Tadepalli,  1987]  was  uied  on  all  data  sets.  This  is  a 
heuristic  search  procedure  that  attempts  to  find  the  best 
single  rule  in  disjunctive  normal  form.  It  can  be  viewed  as  a 
heuristic  approximation  to  exhaustive  search.  It  is  applicable 
to  problems  where  a  relatively  short  rule  provides  a  good 
solution.  For  such  problems,  it  should  have  an  advantage  in 
that  many  combinauons  are  considered,  in  contrast  to 
current  decision  tree  procedures  that  split  nodes  without 
considering  combinations.  For  more  complex  problems,  a 
decision  uee  procedure  is  preferable.  The  appropriate  rule 
length  or  tree  size  is  determined  by  resampling. 

In  addition,  for  two  of  the  smaller  data  sets,  an  exhaustive 
search  was  performed  for  the  optimal  rule  of  length  2  in 
disjunctive  normal  form.  For  the  other  2  data  sets,  the 
published  decision  U’ee  results  are  available  for  methc^s 
using  variations  of  1D3  and  its  successor  C4, 

3  Results 

In  this  section,  we  review  the  results  of  the  various 
classification  methods  on  four  data  sets.  All  of  the  data  sets 
have  been  published,  and  in  most  instances  we  attempted  to 
perform  the  analyses  in  a  manner  consistent  with  previously 
known  results. 

3.1.  Iris  Data 


^Continuous  vanables  were  broken  into  10  tbinary)  intervals  with  width 
of  half  a  standard  deviation  from  the  mean 

^The  program  was  readily  ported  to  a  Sun  4 

*These  two  parameters  were  changed  from  the  program  defaults  because 
It  was  observed  that  the  program  converged  towards  a  solution  much  faster, 
and  no  problems  were  encountered  with  local  masimums. 

’For  the  studied  data  sets,  sequential  presentation  tended  to  lead  rather 
quickly  to  a  local  maximum. 


The  iris  data  was  used  by  Fisher  in  his  derivation  of  the 
linear  discriminant  funcuo’n  [Fisher,  1936],  and  it  still  is  the 
standard  disenmmant  analysis  example  used  in  most  current 
statistical  routines  such  as  SAS  or  I.MSL.  Linear  or 
quadratic  discriminants  under  assumptions  of  normality 
perform  exuemely  well  on  this  data  set.  Three  classes  of 
ins  are  discriminated  using  4  continuous  features.  The  data 
set  consists  of  150  cases,  50  for  each  class.  Figure  3 
summanzes  the  rcsulLs.  The  first  error  rate  is  the  apparent 
error  rate  on  all  cases;  the  second  error  rate  is  the  leaving- 


out-one  error  rate.  Leaving-one-out  results  have  been 
previously  widely  disseminate  for  several  of  the  statistical 
pattern  recognition  methods. 
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Figure  3:  Comparative  Performance  on  Fisher's  Iris  Data 

The  rule-based  solution  has  2  rules  with  a  total  of  3 
variables.*®  For  the  neural  nets,  the  apparent  error  rate  is  the 
average  of  five  trials.  The  leaving-one-out  result  is  the 
average  of  5  complete  leaving-one-out  trials.  The  nets  were 
trained  for  lOW  epochs.  The  best  neural  net  in  terms  of 
cross-validated  error  occurs  at  3  hidden  units,  and  is  the  one 
listed  in  Figure  3.  The  relationship  between  the  number  of 
hidden  units  and  the  error  rates  is  listed  in  Figure  4. 


Appareni  0«  Leaving-one-out::¥: 


Figure  4:  Neural  Net  Error  Rates  for  Iris  Data 

3.2.  Appendicitis  Data 

This  data  set  is  from  a  published  study  on  the  assessment  of 
8  laboratory  tests  to  confirm  the  diagnosis  of 
^pendicitis  [Marchand,  Van  Lente,  and  Galen,  19831.'* 
Following  surgery,  only  85  of  106  patients  were  confirmed 
by  biopsy  to  nave  had  appendicitis.  Thus,  the  ability  to 


discriminate  the  true  appendicitis  patients  by  lab  tests  prior 
to  surgery  would  prove  extremely  valuable. 

The  samples  consist  of  106  patients  and  8  diagnostic 
tests.  Because  one  test  had  some  missing  values,  for 
purposes  of  comparison,  we  excluded  results  from  that  test. 
Figure  5  summarizes  the  results.  The  first  error  rate  is  the 
appvent  error  rate  on  all  cases;  the  second  error  rate  is  the 
leaving-out-one  error  rate. 
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Figure  5:  Comparative  Performance  on  Appendicitis  Data 
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Figure  6:  Neural  Net  Error  Rates  for  Appendicitis  Data 

The  rule-based  solution  has  1  rule  with  a  total  of  2 
variables.  For  the  neural  nets,  the  apparent  error  rate  is  the 
average  of  five  trials.  The  leaving-one-out  result  is  for  a 
single  leaving-one-out  trial. The  nets  were  trained  for 
15(J<X)  epochs.  The  best  neural  net  in  terms  of  cross- 
validated  error  occurs  at  0  hidden  units,  and  is  the  one  listed 
in  Figure  5.  The  relationship  between  the  number  of  hidden 
units  and  the  error  rates  is  listed  in  Figure  6. 


optimal  rule  is  also  induced  by  PVM  during  cross-validation. 

"These  are  patients  admitted  to  an  emergency  room  with  a  tentative 
diagnosis  of  acute  appendicitis. 


^The  results  for  the  average  of  5  complete  leaving-one-oui  tnals  is 
available  for  1000  epochs.  These  show  poorer  performance,  but  liXl  epochs 
were  not  sufficient  for  training  the  larger  number  of  hidden  units 


3  J.  Cancer  Data 

A  data  set  for  evaluating  the  promosis  of  breast  cancer 
recurrence  was  analyze  by  Michalski’s  AQIS  rule 
induction  program  and  rep^^  in  [Michalsld.  Mozetic, 
Hong,  and  Lavrac,  1986].  They  reported  a  64%  accuracy 
rate  fcM'  expert  physicians,  and  a  68%  rate  for  AQIS,  and  a 
72%  rate  tor  the  pruned  tree  proc^ure  of 
ASSISTANT  [Kononenko,  Bratko,  and  Roskar,  1986],  a 
descendant  of  ID3.^^  The  authors  derived  the  error  rates  by 
randomly  resampling  4  times  using  a  70%  train  and  a  30% 
testparution. 

The  samples  consist  of  286  samples,  9  tests,  and  2 
classes.  We  created  4  randomly  sampled  data  sets  with  70% 
train  and  a  30%  test  partitions;  each  method  was  tried  on 
each  of  the  four  data  sets  and  the  results  averaged.  Thus,  the 
experimental  results  are  consistent  with  the  original  Study- 
Figure  7  summarizes  the  results.  The  first  error  rate  is  the 
apparent  error  rate  on  the  training  cases;  the  second  error 
rate  is  the  error  rate  on  the  test  cases. 
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Figure  7*.  Comparative  Performance  on  Cancer  Data 

The  rule-based  solution  has  1  rule  with  a  total  of  2 
variables.*^  Few  the  neural  nets,  the  apparent  error  rate  is 
the  average  of  ten  training  trials.  Each  testing  result  is  the 
corresponding  average  testing  result  of  the  same  10 
complete  trials.*^  The  nets  were  trained  for  2000  epochs. 
The  Dest  neural  net  in  terms  of  cross- validated  error  occurs 
at  0  hidden  units,  and  is  the  one  listed  in  '^igure  7.  The 
relationship  between  the  number  of  hidden  units  and  the 
error  rates  is  listed  in  Figure  8. 

3.4.  Thyroid  Data 

Quinlan  reported  on  results  of  his  analysis  of  hypothyroid 
data  in  [Quinlan,  1987b],  and  in  greater  detail  in  [Quinlan, 
1987a].  The  problem  is  to  determine  whether  a  patient 
referred  to  the  clinic  is  hypothyroid,  the  most  common 
thyroid  problem.  In  contrast  to  the  previous  applications, 
relatively  large  numbers  of  samples  are  available. 

The  samples  consist  of  3772  cases  from  the  year  1985. 
These  are  the  same  cases  used  in  the  original  report  and 
were  used  for  training.  The  3428  cases  from  1986  were  used 
as  test  cases.  There  are  22  (principal)  tests,  and  3  classes. 
Over  10%  of  the  values  are  missing  brcause  some  lab  tests 
were  deemed  unnecessary.  For  purposes  of  comparison  of 


'^e  prevalence  of  the  larger  class  is  70%. 

'*The  same  rule  was  induced  on  all  four  70%  training  sets, 

'^Also  considered  was  the  best  of  the  10  training  results  and  its 
corresponding  test  result.  These  results  are  within  1%  of  the  average 
results. 


I 


Figure  8:  Neural  Net  Error  Rates  for  Cancer  Data 

the  methods,  these  values  were  filled  in  with  the  mean  value 
for  the  corresponding  class. 

Figure  9  sumniarizes  the  results.*^  The  first  error  rate  is 
the  error  rate  on  the  3772  training  cases;  the  second  error 
rate  is  the  error  rate  on  the  3428  test  cases.  From  a  medical 
perspective,  it  is  known  that  (based  on  lab  tests)  excellent 
classification  can  be  achieved  for  diagnosing  thyroid 
dysfunction.  For  these  data,  the  correct  answer  stored  with 
each  sample  is  derived  from  a  large  rule-based  system  in  use 
in  Australia.  While  most  error  rates  in  Figure  9  are  low,  it  is 
important  to  note  that  1%  of  the  total  sample  represents  over 
70  people.  Over  92%  of  the  samples  are  not  hypothyroid. 
Therefore,  any  acceptable  classifier  must  do  significantly 
better  than  92%. 
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Figure  9:  Comparative  Performance  o.i  Thyroid  Data 

The  rule-based  solution  has  2  rules  with  a  total  of  8 
variables.  For  the  neural  nets,  the  apparent  error  rate  is  the 
best  of  2  trials.  The  nets  were  trained  for  2000  epochs.  The 
best  neural  net  in  terms  of  testing  error  occurs  at  3  hidden 
units.  The  relationship  between  the  number  of  hidden  units 
and  the  error  rates  is  listed  in  Figure  10. 


’*The  C4  tree  cited  in  the  onginal  study  has  a  training  error  rate  of  0021 
and  a  testing  error  rate  of  .0085.  However,  the  training  dau  contained 
missing  values. 


Figure  10:  Neural  Net  Error  Rates  for  Thyroid  Data 

The  cpu  limes  for  training  a  neural  net  with  back 
proparauon  on  this  size  data  set  were  ^eat:  for  3  hidden 
units  500  epochs  required  1 .5  hours  of  ^n  4/280  cpu  time, 
while  24  units  required  11.5  hours.  In  Figure  10,  the 
apparent  error  rates  for  the  larger  numbers  of  hidden  units 
support  the  hypothesis  that  additional  training  was 
necessary.  We  initiated  a  new  set  of  experiments  with  fewer 
num.bers  of  hidden  units.*^  We  let  these  trials  run  for  an 
unlimited  period  of  time  as  long  as  slight  process  was 
being  made,  as  indicated  by  sampling  every  5CO  epochs. 
Therefore,  for  this  experiment  not  every  size  neural  net  was 
run  an  equal  number  of  epochs.  Figure  I  \  summarizes  the 
results  of  this  effort  The  best  result  encountered  during  the 
sampling  of  results  occurred  for  3  hidden  ui.its,  and  this 
result  is  listed  in  Figure  9. 
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Figure  11:  Extended  Neural  Network  Training  on  Thyrc  Data 

4  Discussion 

The  applications  presented  here  represent  a  reasonable  cross 
section  of  protot^ica'  problems  widely  encountered  in  the 
many  research  communities.  Each  problem  has  few  classes 
and  IS  characterized  by  uncertainty  of  classification.  In  some 
applications  such  as  the  cancer  data,  the  features  were 
relatively  weak  and  good  predicuve  capabilities  are 
unlikely.  In  others,  such  as  the  thyroid  data,  the  features  are 
quite  strong,  and  almost  error-free  prediction  is  possible. 

For  the  smaller  data  sets,  resampling  was  used.  With  over 
100  cases,  resampling  techniques  such  as  cross-validation 
should  give  excellent  estimates  for  the  true  error  rate.  In 


’’The  momentum  was  changed  to  .9.  and  the  learning  rale  to  5  to  help 
prevent  local  maximums. 


fact,  the  data  from  the  iris  study  has  been  reviewed  over 
many  years,  and  comparisons  have  been  made  on  the  basis 
of  the  leaving-one-out  error.  It  is  interesting  to  note  (for 
those  who  wish  to  avoid  concepts  such  multivariate 
distributions  and  covariance  matrices^  that  a  trivial  set  of  2 
rules  with  a  total  of  3  variables  can  produce  equal  results. 

For  many  application  fields,  this  in  fact  is  a  major 
advantage  of  the  logistic  approaches,  i.e.  the  rule  based  or 
decision  u-ee  based  approaches.  The  solution  is  compatible 
with  elementary  human  reasoning  and  explanations.  It  is 
also  compatible  with  rule-bas^  systems.  Thus,  if 
everything  were  equal,  many  would  choose  the  logistic 
solution. 

In  our  experiments,  everything  was  not  equal.  In  every 
case  a  logistic  solution  was  round  that  exceeded  the 
perfcHinance  of  solu'ions  posed  using  different  underlying 
models.  PVM  has  an  advantage  when  a  short  rule  works, 
but  for  more  complex  problems  the  decision  tree  would  be 
indicated.  We  note  that  the  largest  problem  studied,  the 
thyroid  application,  is  somewhat  bi^ed  towards  logistic 
solutions.  The  endpoints  were  derived  from  a  rule-based 
system  that  apparently  uses  the  same  lab  test  thresholds  to 
specify  high  or  low  '“adings  for  all  hypotheses. 

These  results  cannot  necessarily  be  extrapolated  to  more 
complex  problems.  However,  our  experience  is  not  unique. 
Numerous  experiments  by  me  developers  of 
CART  [Breiman,  Friedman,  Olshen,  and  Stone,  1984] 
demonstrated  mat  in  most  instances,  mey  found  a  tree 
superior  to  alternative  statistical  classification  techniques. 

In  our  experiments,  the  statistical  classifiers  penormed 
consistently  wim  expectations.  The  linear  classifiers  (wim 
the  assumption  of  a  normal  distribution)  gave  good 
performance  in  all  cases  except  the  myroid  experiment. 
These  classifiers  are  widely  usm,  because  mey  are  simple 
and  me  training  error  rate  usually  holds  up  well  on  test 
cases.  The  natural  extension,  me  quadratic  classifier,  fits 
better  to  normally  distributed  data,  but  degrades  r^idly 
wim  nonnormal  data.  It  did  poorly  in  most  or  our 
experiments.  Similarly  Bayes  wim  independence  does 
modersmely  well,  but  me  2nd  order  fits  were  not  good  on  the 
test  data.  Nearest  neighbor  does  well  wim  go^  features, 
but  tends  to  degrade  wim  many  poor  features.  There  are 
many  alternative  statistical  classifiers  mat  might  be  tried, 
such  nonparametric  piecewise  linear  classifiers  [Foroutan 
and  Sklansky,  1985).  In  addition,  one  could  try  to  reduce 
the  number  of  features  for  training  (i.e.  feature  selection), 
since  many  of  mese  memods  can  actually  improve 
performance  on  test  cases  by  feature  reduction.** 

The  neural  nets  did  perform  well,  and  mey  were  me  only 
statistical  classifiers  to  do  well  on  me  myroid  problem. 
However,  overall  mey  were  not  me  best  classifiers;  mey 
consumed  enormous  amounts  of  cpu  time;  and  they  were 
sometimes  equaled  by  simple  classifiers.  Research  on 
improving  performance  for  neural  nets  training  and 
representation  is  quite  active,  so  it  may  be  possible  mat 
performance  can  be  improved. 

The  relationship  between  the  number  of  hidden  units  and 
the  two  error  rates  followed  me  classical  pattern  for 
classifiers.  As  me  number  of  hidden  units  increased,  me 
apparent  error  decreased.*'^  However,  at  some  point,  as  me 
classifier  overfits  the  data,  me  true  error  rate  curve  flattens 
and  even  begins  to  increase.  Much  the  same  behavior  can 
be  observed  for  decision  trees  as  me  number  of  nodes 
increases,  or  production  rules,  as  the  rule  length  increases. 


‘*Because  the  linear  tU^sificr  performed  poorly  on  the  thyrou  cases,  «e 
ined  lo  irain  a  classifier  on  lusi  the  iab  lesis,  which  are  the  mosi  significant 
tests  The  results  did  not  improve. 

’Occasionally  there  is  some  slight  vanabilii>  m  the  decrease  of  the 
apparent  error  rate  because  back  propagation  minimizes  disunce  as 
opposed  to  errors 


The  question  remains  ope*"  as  to  how  well  any  classifier 
can  do  on  more  complex  p.oblems  with  many  more  features 
and  many  more  classes,  possibly  non-mutually  exclusive 
classes.  There  are  also  questions  of  how  many  cases  are 
actually  needed  to  l^am  significant  concepts.  Our  study 
does  not  answer  many  of  these  questions,  but  helps  show  m 
a  limited  fashion  where  we  are  currently  with  many 
commonly  used  classification  techniques. 

Appendix:  Induced  Rules 

•  iris.  Petal  length  <  3  Iris  Seiosa;  Petal 
length  >  4.9  OR  Petal  Width  >  1.6 Iris 
Virginica 

•  appendii  ids.  MNEA>66(X)  OR  MBAP>1 1 

•  cancer.  Involved  Nodes>0  &  Degree=3 

•  thyroid.  TSH>6. 1  &  FTI  <65  —¥  primary 
hypothyroid:  TSH>6  &  TT4<149  &  On 
TTiyroxinsfalse  &  Fn>64  &  Surgery=false 
-» compensated  hypothyroid 
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